Introduction

Posing Questions

In this project we are going to explore the data systematically, so there is no specific question we want to answer.

Data Sources

  1. Name: result.csv

Import of the needed libraries:

In [1]:
#Import important libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import zipfile
import requests
import geopy.distance
from sklearn.cluster import KMeans

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Data Gathering & Wrangling

At first we need to gather the data. The code below will download, unzip and merge the data together to a final *.csv file.

In [ ]:
#define filenames to download
year_data = [x for x in range(201801, 201813)] + [x for x in range(201901, 201903)]

#loop over years
for year in year_data[:3]:
    
    #url structure
    url = f"https://s3.amazonaws.com/fordgobike-data/{year}-fordgobike-tripdata.csv.zip"
    
    #get response
    response = requests.get(url)
   
    #save file
    with open(f"{year}-fordgobike-tripdata.csv.zip", mode = "wb") as file:
        file.write(response.content)
In [ ]:
#https://stackoverflow.com/questions/3451111/unzipping-files-in-python/3451150
#define file names
files = [x for x in os.walk("./Data")][0][2]

#loop over file names
for x in files:
    if ".zip" in x:
        with zipfile.ZipFile(f"./Data/{x}",'r') as zip_ref:
            zip_ref.extractall("./Data")
In [ ]:
month_range =  [int(x[:6]) for x in files if ".zip" not in x] #can be exchanged with year_data .. 

#read in the first file
result = pd.read_csv(f"./Data/{month_range[0]}-fordgobike-tripdata.csv")
len_df = len(result)

#loop over all files and append data one by one
for i, month in enumerate(month_range):
    
    if i+2 > len(month_range):
        break
    
    else:    
        df_append  = pd.read_csv(f"./Data/{month_range[i+1]}-fordgobike-tripdata.csv")
        len_df += len(df_append)
        assert all(result.columns == df_append.columns)
        result = result.append(df_append)
        print(f"{month_range[i+1]}")
        
result.to_csv("result.csv", index = False)
print("Done")           
print(len_df == len(result))
del len_df, df_append, result, files, month_range

Data Assessing

Now that we have our final "result.csv" file, we have to make sure that the data is ready for an analysis.

In [278]:
df = pd.read_csv("result.csv")
In [311]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2239215 entries, 0 to 2239214
Data columns (total 16 columns):
duration_sec               int64
start_time                 object
end_time                   object
start_station_id           float64
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    int64
user_type                  object
member_birth_year          float64
member_gender              object
bike_share_for_all_trip    object
dtypes: float64(7), int64(2), object(7)
memory usage: 273.3+ MB

Here we can already see, that the start_time and end_time column is not a datetime - object. Also the ID's and the member_birth_year should be object type.

In [312]:
df.head()
Out[312]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 75284 2018-01-31 22:52:35.2390 2018-02-01 19:47:19.8240 120.0 Mission Dolores Park 37.761420 -122.426435 285.0 Webster St at O'Farrell St 37.783521 -122.431158 2765 Subscriber 1986.0 Male No
1 85422 2018-01-31 16:13:34.3510 2018-02-01 15:57:17.3100 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 2815 Customer NaN NaN No
2 71576 2018-01-31 14:23:55.8890 2018-02-01 10:16:52.1160 304.0 Jackson St at 5th St 37.348759 -121.894798 296.0 5th St at Virginia St 37.325998 -121.877120 3039 Customer 1996.0 Male No
3 61076 2018-01-31 14:53:23.5620 2018-02-01 07:51:20.5000 75.0 Market St at Franklin St 37.773793 -122.421239 47.0 4th St at Harrison St 37.780955 -122.399749 321 Customer NaN NaN No
4 39966 2018-01-31 19:52:24.6670 2018-02-01 06:58:31.0530 74.0 Laguna St at Hayes St 37.776435 -122.426244 19.0 Post St at Kearny St 37.788975 -122.403452 617 Subscriber 1991.0 Male No
In [313]:
ms.matrix(df);

It seems that we also have a problem with missing data.

In [314]:
df[df.start_station_id.isnull()].head(10)
Out[314]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
623364 5572 2018-06-30 20:58:31.8550 2018-06-30 22:31:24.5930 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.94 4202 Customer 1989.0 Male No
623376 573 2018-06-30 22:12:27.5110 2018-06-30 22:22:01.1740 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.93 4095 Subscriber 1994.0 Male Yes
623393 850 2018-06-30 22:01:25.5850 2018-06-30 22:15:36.1510 NaN NaN 37.41 -121.94 NaN NaN 37.41 -121.95 4122 Customer 1990.0 Male No
623429 562 2018-06-30 21:52:05.1180 2018-06-30 22:01:27.4400 NaN NaN 37.41 -121.94 NaN NaN 37.42 -121.94 4184 Subscriber 1989.0 Male No
623456 819 2018-06-30 21:37:34.5060 2018-06-30 21:51:13.7290 NaN NaN 37.41 -121.94 NaN NaN 37.41 -121.96 4137 Customer 1985.0 Male No
623472 1176 2018-06-30 21:23:30.4880 2018-06-30 21:43:07.3310 NaN NaN 37.41 -121.94 NaN NaN 37.41 -121.94 4152 Customer 1961.0 Male No
623479 160 2018-06-30 21:37:50.9260 2018-06-30 21:40:30.9300 NaN NaN 37.41 -121.94 NaN NaN 37.42 -121.94 4105 Subscriber 1990.0 Female No
623548 7242 2018-06-30 19:19:19.5600 2018-06-30 21:20:02.1110 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.94 4082 Customer 1990.0 Male No
623549 725 2018-06-30 21:06:44.3580 2018-06-30 21:18:50.2180 NaN NaN 37.41 -121.96 NaN NaN 37.41 -121.96 4156 Customer 1983.0 Male No
623550 694 2018-06-30 21:05:49.6180 2018-06-30 21:17:24.2670 NaN NaN 37.40 -121.93 NaN NaN 37.41 -121.94 4136 Customer 1987.0 Female No
In [315]:
df[df.start_station_id.isnull()].tail(10)
Out[315]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
2211273 984 2019-02-06 08:39:02.6480 2019-02-06 08:55:26.6830 NaN NaN 37.41 -121.94 NaN NaN 37.40 -121.92 4249 Subscriber 1976.0 Male No
2217203 1159 2019-02-05 14:53:30.4310 2019-02-05 15:12:49.4460 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.94 4174 Customer 1986.0 Male No
2217647 823 2019-02-05 13:28:05.5940 2019-02-05 13:41:49.3090 NaN NaN 37.41 -121.94 NaN NaN 37.41 -121.96 4092 Subscriber 1999.0 Male No
2228284 3791 2019-02-03 14:53:55.9470 2019-02-03 15:57:07.0860 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.94 4174 Subscriber 1977.0 Male No
2230610 8209 2019-02-02 14:26:59.9650 2019-02-02 16:43:49.5180 NaN NaN 37.40 -121.93 NaN NaN 37.41 -121.94 4249 Subscriber 1993.0 Male Yes
2231957 1447 2019-02-02 12:03:04.5440 2019-02-02 12:27:12.2670 NaN NaN 37.40 -121.93 NaN NaN 37.40 -121.93 4249 Customer 1984.0 Male No
2235533 309 2019-02-01 12:59:45.9690 2019-02-01 13:04:55.4260 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.93 4249 Customer 1987.0 Female No
2235773 659 2019-02-01 12:17:37.6750 2019-02-01 12:28:37.0140 NaN NaN 37.41 -121.96 NaN NaN 37.41 -121.94 4092 Subscriber 1999.0 Female No
2235909 2013 2019-02-01 11:33:55.1470 2019-02-01 12:07:28.9400 NaN NaN 37.40 -121.94 NaN NaN 37.40 -121.94 4251 Customer 1990.0 Female No
2237004 312 2019-02-01 09:26:34.8030 2019-02-01 09:31:46.9210 NaN NaN 37.40 -121.93 NaN NaN 37.40 -121.93 4208 Subscriber 1987.0 Male No

All the rows with missing data have similar coordinates. Also these coordinates are less precise than the other entries with no missing data.

In [316]:
df[df.start_station_id.isnull()].start_station_latitude.value_counts()
Out[316]:
37.41    5017
37.40    4575
37.42    1233
37.39     843
37.38     412
37.34      37
37.43      23
37.37      17
37.35      17
37.33      17
37.36       8
45.51       4
40.66       3
37.44       3
37.32       2
44.95       2
45.50       2
37.45       1
Name: start_station_latitude, dtype: int64
In [317]:
df[df.start_station_id.isnull()].start_station_longitude.value_counts()
Out[317]:
-121.94    4189
-121.93    2915
-121.96    2024
-121.95    1607
-121.92    1278
-121.97      52
-121.91      41
-121.98      33
-121.90      25
-121.89      21
-121.88       8
-73.57        6
-121.87       4
-121.84       3
-74.01        3
-121.86       3
-93.22        2
-121.83       1
-121.99       1
Name: start_station_longitude, dtype: int64

The most frequent coordinates are at 37.41 latitude and -121.94 longitude.

In [322]:
axes, figure = plt.subplots(figsize = (10,5))
sns.scatterplot(data = df[df.start_station_id.isnull()], x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
sns.scatterplot(data = df.dropna(subset=["start_station_id"]).sample(50000), x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
plt.xlim(-121.8,-122)
plt.ylim(37.3,37.45)
plt.tight_layout()

We can see, that these 'nan' stations are mostly out of the range of the other stations.

In [318]:
#all ids are a start- and ending point
start_station_id_list = list(df.start_station_id.drop_duplicates().dropna().astype("int"))
start_station_id_list.sort()

end_station_id_list = list(df.end_station_id.drop_duplicates().dropna().astype("int"))
end_station_id_list.sort()

start_station_id_list == end_station_id_list
Out[318]:
True
In [319]:
df_station_names = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].copy()

df_station_names.rename(columns={"end_station_id": "id", 
                                 "end_station_name": "station_name", 
                                 "end_station_latitude": "station_latitude", 
                                 "end_station_longitude": "station_longitude"}, inplace = True)

df_station_names.drop_duplicates(inplace = True)
df_station_names.dropna(inplace = True)
df_station_names.sort_values("id", inplace = True)
df_station_names.reset_index(inplace = True, drop = True)
In [323]:
axes, figure = plt.subplots(figsize = (10,5))
#sns.scatterplot(data = df[df.start_station_id.isnull()], x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
sns.scatterplot(data = df_station_names, x = "station_longitude", y = "station_latitude", s = 200)

plt.tight_layout()

At first we can see, that there are three clear clusters in this dataset. The website of FordGoBike differs following zones: "San Francisco, East Bay, San José"

In [320]:
df_station_names.id.value_counts()[df_station_names.id.value_counts()>1]
Out[320]:
192.0    3
205.0    3
130.0    3
221.0    3
244.0    3
212.0    3
208.0    3
233.0    3
101.0    3
280.0    3
250.0    2
364.0    2
234.0    2
358.0    2
224.0    2
281.0    2
344.0    2
245.0    2
302.0    2
321.0    2
173.0    2
272.0    2
345.0    2
80.0     2
Name: id, dtype: int64

Here we can see an interesting thing: One ID can refer to multiple different stations.

In [325]:
df_station_names[df_station_names.duplicated("id", keep = False)]
Out[325]:
id station_name station_latitude station_longitude
70 80.0 Townsend St at 5th St 37.775306 -122.397380
71 80.0 Townsend St at 5th St 37.775235 -122.397437
89 101.0 San Bruno Ave at 16th St 37.766008 -122.405677
90 101.0 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706
91 101.0 15th St at Potrero Ave 37.767079 -122.407359
116 130.0 22nd St Caltrain Station 37.757718 -122.391813
117 130.0 22nd St Caltrain Station 37.757288 -122.392051
118 130.0 22nd St Caltrain Station 37.757369 -122.392057
157 173.0 Shattuck Ave at 55th Ave 37.840364 -122.264488
158 173.0 Shattuck Ave at 55th St 37.840364 -122.264488
175 192.0 MLK Jr Way at 36th St (Temporary Location) 37.825794 -122.269399
176 192.0 37th St at West St 37.826696 -122.271795
177 192.0 37th St at West St 37.826696 -122.271797
189 205.0 Shafter Ave at Cavour St 37.837947 -122.257243
190 205.0 Miles Ave at Cavour St 37.838800 -122.258732
191 205.0 Miles Ave at Cavour St (Temporary Location) 37.838800 -122.258732
194 208.0 S. 4th St at San Carlos St 37.330040 -121.881821
195 208.0 William St at 4th St (Temporary Location) 37.329964 -121.881922
196 208.0 S. 4th St at San Carlos St 37.332836 -121.883912
200 212.0 Mosswood Park 37.824893 -122.260437
201 212.0 Mosswood Park 37.824931 -122.260479
202 212.0 Webster St at MacArthur Blvd (Temporary Location) 37.825008 -122.261649
211 221.0 6th Ave at E 12th St (Temporary Location) 37.794396 -122.253842
212 221.0 E 12th St at 6th Ave 37.794353 -122.253874
213 221.0 12th St at 6th Ave 37.794353 -122.253874
216 224.0 21st Ave at International Blvd 37.784855 -122.239305
217 224.0 21st Ave at International Blvd 37.785157 -122.238915
226 233.0 4th Ave at E 12th St (Temporary Location) 37.795913 -122.255547
227 233.0 12th St at 4th Ave 37.795812 -122.255555
228 233.0 E 12th St at 4th Ave 37.795812 -122.255555
229 234.0 Fruitvale Ave at International Blvd 37.777682 -122.225771
230 234.0 Farnam St at Fruitvale Ave 37.778058 -122.225400
240 244.0 Shattuck Ave at Hearst Ave 37.873748 -122.268648
241 244.0 Shattuck Ave at Hearst Ave 37.873792 -122.268618
242 244.0 Shattuck Ave at Hearst Ave 37.873676 -122.268487
243 245.0 Downtown Berkeley BART 37.870139 -122.268422
244 245.0 Downtown Berkeley BART 37.870348 -122.267764
249 250.0 North Berkeley BART Station 37.873558 -122.283093
250 250.0 North Berkeley BART Station 37.874014 -122.283019
269 272.0 Woolsey St at Sacramento St1 37.850578 -122.278175
270 272.0 Woolsey St at Sacramento St 37.850578 -122.278175
278 280.0 6th St at San Fernando St (Temporary Location) 37.337038 -121.884125
279 280.0 San Fernando St at 7th St 37.337122 -121.883215
280 280.0 San Fernando at 7th St 37.337246 -121.883074
281 281.0 9th St at San Fernando St 37.338395 -121.880797
282 281.0 9th St at San Fernando 37.338395 -121.880797
303 302.0 Tamien Station 37.312854 -121.882941
304 302.0 Tamien Station 37.347721 -121.890856
321 321.0 5th at Folsom 37.780146 -122.403071
322 321.0 5th St at Folsom 37.780146 -122.403071
334 344.0 16th St Depot 37.766349 -122.396289
335 344.0 16th St Depot 37.766349 -122.396292
336 345.0 Hubbell St at 16th St 37.766483 -122.398279
337 345.0 Hubbell St at 16th St 37.766474 -122.398295
345 358.0 Williams Ave at 3rd St 37.729279 -122.392896
346 358.0 Lane St at Van Dyke Ave 37.729252 -122.392378
352 364.0 China Basin St at 3rd St 37.772000 -122.389970
353 364.0 Mission Rock St at 3rd St 37.772886 -122.389940

It is possible, that some these stations changed position over time and/or got a new name. This is a data consistency problem, because the ID's are not unique.

In [326]:
df_station_names.drop_duplicates(subset = ["id", "station_latitude", "station_longitude"]).id.value_counts().head(25) #excludes 4 cases
Out[326]:
192.0    3
280.0    3
130.0    3
244.0    3
212.0    3
208.0    3
101.0    3
302.0    2
345.0    2
234.0    2
358.0    2
224.0    2
221.0    2
344.0    2
245.0    2
250.0    2
364.0    2
205.0    2
233.0    2
80.0     2
116.0    1
112.0    1
106.0    1
136.0    1
134.0    1
Name: id, dtype: int64

If we filter the data by these stations, which didn't changed the position and just got a new name, we can exclude 4 cases. The rest of the cases are still relevant to investigate.

In [327]:
df_consistency = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].copy().merge(df_station_names, left_on = "end_station_id", right_on = "id", how = "outer")

df_consistency_issues = df_consistency.drop_duplicates().dropna()
In [328]:
df_stat_dist = df_consistency_issues[(df_consistency_issues.end_station_latitude != df_consistency_issues.station_latitude) & \
                                     (df_consistency_issues.end_station_longitude != df_consistency_issues.station_longitude)].copy()
In [329]:
#https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude/43211266#43211266

def calculate_coord_dist(col):
    coords_1 = (col[0], col[1])
    coords_2 = (col[2], col[3])
    
    return geopy.distance.distance(coords_1, coords_2).m
In [330]:
df_stat_dist["dist_in_m"] = df_stat_dist[["end_station_latitude", "end_station_longitude", "station_latitude", "station_longitude"]].apply(calculate_coord_dist, axis = 1)
In [331]:
df_stat_dist_data = df_stat_dist.drop_duplicates("dist_in_m").sort_values("dist_in_m", ascending = False)
df_stat_dist_data
Out[331]:
end_station_id end_station_name end_station_latitude end_station_longitude id station_name station_latitude station_longitude dist_in_m
2225146 302.0 Tamien Station 37.312854 -121.882941 302.0 Tamien Station 37.347721 -121.890856 3932.749001
2089615 208.0 S. 4th St at San Carlos St 37.332836 -121.883912 208.0 William St at 4th St (Temporary Location) 37.329964 -121.881922 364.265416
2089614 208.0 S. 4th St at San Carlos St 37.332836 -121.883912 208.0 S. 4th St at San Carlos St 37.330040 -121.881821 361.415563
2064741 192.0 MLK Jr Way at 36th St (Temporary Location) 37.825794 -122.269399 192.0 37th St at West St 37.826696 -122.271797 233.626394
2062975 192.0 37th St at West St 37.826696 -122.271795 192.0 MLK Jr Way at 36th St (Temporary Location) 37.825794 -122.269399 233.531415
2114531 101.0 San Bruno Ave at 16th St 37.766008 -122.405677 101.0 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706 191.626231
2114532 101.0 San Bruno Ave at 16th St 37.766008 -122.405677 101.0 15th St at Potrero Ave 37.767079 -122.407359 189.950934
742450 205.0 Shafter Ave at Cavour St 37.837947 -122.257243 205.0 Miles Ave at Cavour St 37.838800 -122.258732 161.715577
1519216 212.0 Webster St at MacArthur Blvd (Temporary Location) 37.825008 -122.261649 212.0 Mosswood Park 37.824893 -122.260437 107.512094
1510980 212.0 Mosswood Park 37.824931 -122.260479 212.0 Webster St at MacArthur Blvd (Temporary Location) 37.825008 -122.261649 103.387088
2343287 364.0 Mission Rock St at 3rd St 37.772886 -122.389940 364.0 China Basin St at 3rd St 37.772000 -122.389970 98.394696
1905819 280.0 San Fernando at 7th St 37.337246 -121.883074 280.0 6th St at San Fernando St (Temporary Location) 37.337038 -121.884125 95.963430
1915003 280.0 6th St at San Fernando St (Temporary Location) 37.337038 -121.884125 280.0 San Fernando St at 7th St 37.337122 -121.883215 81.208846
1872718 245.0 Downtown Berkeley BART 37.870348 -122.267764 245.0 Downtown Berkeley BART 37.870139 -122.268422 62.382139
2121831 101.0 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706 101.0 15th St at Potrero Ave 37.767079 -122.407359 58.533405
1950539 234.0 Fruitvale Ave at International Blvd 37.777682 -122.225771 234.0 Farnam St at Fruitvale Ave 37.778058 -122.225400 53.025853
1072319 130.0 22nd St Caltrain Station 37.757718 -122.391813 130.0 22nd St Caltrain Station 37.757288 -122.392051 52.158184
800667 250.0 North Berkeley BART Station 37.874014 -122.283019 250.0 North Berkeley BART Station 37.873558 -122.283093 51.041354
2167783 224.0 21st Ave at International Blvd 37.785157 -122.238915 224.0 21st Ave at International Blvd 37.784855 -122.239305 47.972946
2357010 358.0 Lane St at Van Dyke Ave 37.729252 -122.392378 358.0 Williams Ave at 3rd St 37.729279 -122.392896 45.729367
1054927 130.0 22nd St Caltrain Station 37.757369 -122.392057 130.0 22nd St Caltrain Station 37.757718 -122.391813 44.348904
1905820 280.0 San Fernando at 7th St 37.337246 -121.883074 280.0 San Fernando St at 7th St 37.337122 -121.883215 18.546296
233431 244.0 Shattuck Ave at Hearst Ave 37.873792 -122.268618 244.0 Shattuck Ave at Hearst Ave 37.873676 -122.268487 17.201880
237913 244.0 Shattuck Ave at Hearst Ave 37.873748 -122.268648 244.0 Shattuck Ave at Hearst Ave 37.873676 -122.268487 16.258134
2093155 208.0 S. 4th St at San Carlos St 37.330040 -121.881821 208.0 William St at 4th St (Temporary Location) 37.329964 -121.881922 12.299144
570081 233.0 12th St at 4th Ave 37.795812 -122.255555 233.0 4th Ave at E 12th St (Temporary Location) 37.795913 -122.255547 11.176438
483846 80.0 Townsend St at 5th St 37.775306 -122.397380 80.0 Townsend St at 5th St 37.775235 -122.397437 9.342568
1054928 130.0 22nd St Caltrain Station 37.757369 -122.392057 130.0 22nd St Caltrain Station 37.757288 -122.392051 8.956456
1510978 212.0 Mosswood Park 37.824931 -122.260479 212.0 Mosswood Park 37.824893 -122.260437 5.665735
2159233 221.0 12th St at 6th Ave 37.794353 -122.253874 221.0 6th Ave at E 12th St (Temporary Location) 37.794396 -122.253842 5.546263
233429 244.0 Shattuck Ave at Hearst Ave 37.873792 -122.268618 244.0 Shattuck Ave at Hearst Ave 37.873748 -122.268648 5.525802
2247298 345.0 Hubbell St at 16th St 37.766474 -122.398295 345.0 Hubbell St at 16th St 37.766483 -122.398279 1.702005
2261010 344.0 16th St Depot 37.766349 -122.396289 344.0 16th St Depot 37.766349 -122.396292 0.250294
2062977 192.0 37th St at West St 37.826696 -122.271795 192.0 37th St at West St 37.826696 -122.271797 0.153375

We can see that the range of distance between new stations lies between 0 to 364 meters (and nearly 4 km in the worst case). This needs to be a part of data cleaning. Now we will look at the overall number structure of this dataset.

In [260]:
df.describe().round(2)
Out[260]:
duration_sec start_station_latitude start_station_longitude end_station_latitude end_station_longitude label age dayofweek start_hr end_hr
count 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00 2226999.00
mean 833.69 37.77 -122.35 37.77 -122.35 1.17 149.02 2.60 13.50 13.67
std 2249.12 0.10 0.12 0.10 0.12 0.48 462.35 1.83 4.74 4.77
min 61.00 37.26 -122.47 37.26 -122.47 0.00 17.00 0.00 0.00 0.00
25% 346.00 37.77 -122.41 37.77 -122.41 1.00 28.00 1.00 9.00 9.00
50% 549.00 37.78 -122.40 37.78 -122.40 1.00 33.00 3.00 14.00 14.00
75% 859.00 37.80 -122.29 37.80 -122.29 1.00 42.00 4.00 17.00 18.00
max 86366.00 37.88 -121.83 37.88 -121.83 2.00 2018.00 6.00 23.00 23.00

Nothing too obvious here. There seems to be an outlier at the maximum duration_sec.

In [261]:
df.query("duration_sec == 86366")
Out[261]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... new_id label age month_year day_month_year dayofweek start_hr end_hr label_name combi
1375987 86366 2018-02-03 17:32:04.042 2018-02-04 17:31:30.788 197 El Embarcadero at Grand Ave 37.808848 -122.24968 197 El Embarcadero at Grand Ave 37.808848 ... 197_0 2 2018 2018-02 2018-02-03 5 17 17 San Francisco 197_0 - 197_0

1 rows × 28 columns

Maybe somebody forgot to unregister the bike.

In [262]:
df.query("duration_sec > 80000")
Out[262]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... new_id label age month_year day_month_year dayofweek start_hr end_hr label_name combi
5131 81785 2018-07-27 17:51:06.310 2018-07-28 16:34:11.905 86 Market St at Dolores St 37.769305 -122.426826 43 San Francisco Public Library (Grove St at Hyde... 37.778768 ... 86_0 1 34 2018-07 2018-07-27 4 17 16 San José 86_0 - 43_0
11320 86003 2018-04-02 16:43:50.239 2018-04-03 16:37:13.786 316 San Salvador St at 1st St 37.330165 -121.885831 304 Jackson St at 5th St 37.348759 ... 316_0 0 2018 2018-04 2018-04-02 0 16 16 East Bay 316_0 - 304_0
43638 82743 2019-01-19 16:13:13.763 2019-01-20 15:12:16.894 11 Davis St at Jackson St 37.797280 -122.398436 64 5th St at Brannan St 37.776754 ... 11_0 1 42 2019-01 2019-01-19 5 16 15 San José 11_0 - 64_0
61586 82849 2018-06-26 18:24:37.670 2018-06-27 17:25:27.530 17 Embarcadero BART Station (Beale St at Market St) 37.792251 -122.397086 345 Hubbell St at 16th St 37.766474 ... 17_0 1 25 2018-06 2018-06-26 1 18 17 San José 17_0 - 345_0
97230 84193 2018-01-10 10:01:32.352 2018-01-11 09:24:46.268 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 72 Page St at Scott St 37.772406 ... 15_0 1 49 2018-01 2018-01-10 2 10 9 San José 15_0 - 72_0
98831 85422 2018-01-31 16:13:34.351 2018-02-01 15:57:17.310 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 ... 15_0 1 2018 2018-01 2018-01-31 2 16 15 San José 15_0 - 15_0
100414 80343 2018-02-16 17:02:41.515 2018-02-17 15:21:45.351 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 342 Colin P Kelly Jr St at Townsend St (Temporary ... 37.781383 ... 15_0 1 2018 2018-02 2018-02-16 4 17 15 San José 15_0 - 342_0
104907 83754 2018-04-13 12:37:04.119 2018-04-14 11:52:58.211 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 93 4th St at Mission Bay Blvd S 37.770407 ... 15_0 1 69 2018-04 2018-04-13 4 12 11 San José 15_0 - 93_0
111713 81678 2018-06-10 18:39:19.656 2018-06-11 17:20:38.104 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 104 4th St at 16th St 37.767045 ... 15_0 1 2018 2018-06 2018-06-10 6 18 17 San José 15_0 - 104_0
112784 82524 2018-06-18 20:54:51.141 2018-06-19 19:50:15.385 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 126 Esprit Park 37.761634 ... 15_0 1 28 2018-06 2018-06-18 0 20 19 San José 15_0 - 126_0
118522 82650 2018-07-29 10:52:09.279 2018-07-30 09:49:39.522 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 5 Powell St BART Station (Market St at 5th St) 37.783899 ... 15_0 1 38 2018-07 2018-07-29 6 10 9 San José 15_0 - 5_0
124332 82884 2018-09-14 09:50:41.149 2018-09-15 08:52:05.414 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 349 Howard St at Mary St 37.781010 ... 15_0 1 2018 2018-09 2018-09-14 4 9 8 San José 15_0 - 349_0
125357 85287 2018-09-21 19:53:31.030 2018-09-22 19:34:58.455 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 53 Grove St at Divisadero 37.775946 ... 15_0 1 30 2018-09 2018-09-21 4 19 19 San José 15_0 - 53_0
126348 80015 2018-09-29 17:48:32.752 2018-09-30 16:02:08.732 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 121 Mission Playground 37.759210 ... 15_0 1 2018 2018-09 2018-09-29 5 17 16 San José 15_0 - 121_0
128193 84270 2018-10-13 11:52:32.859 2018-10-14 11:17:03.820 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 6 The Embarcadero at Sansome St 37.804770 ... 15_0 1 2018 2018-10 2018-10-13 5 11 11 San José 15_0 - 6_0
137374 82362 2019-01-25 08:30:41.624 2019-01-26 07:23:23.662 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 30 San Francisco Caltrain (Townsend St at 4th St) 37.776598 ... 15_0 1 51 2019-01 2019-01-25 4 8 7 San José 15_0 - 30_0
148207 84320 2018-05-04 09:31:57.039 2018-05-05 08:57:17.782 16 Steuart St at Market St 37.794130 -122.394430 50 2nd St at Townsend St 37.780526 ... 16_0 1 40 2018-05 2018-05-04 4 9 8 San José 16_0 - 50_0
166546 84670 2018-11-13 15:13:13.220 2018-11-14 14:44:23.826 16 Steuart St at Market St 37.794130 -122.394430 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 ... 16_0 1 34 2018-11 2018-11-13 1 15 14 San José 16_0 - 21_0
176357 80681 2018-04-07 17:24:15.470 2018-04-08 15:48:56.587 8 The Embarcadero at Vallejo St 37.799953 -122.398525 6 The Embarcadero at Sansome St 37.804770 ... 8_0 1 2018 2018-04 2018-04-07 5 17 15 San José 8_0 - 6_0
182010 81175 2018-09-02 15:03:34.659 2018-09-03 13:36:30.379 8 The Embarcadero at Vallejo St 37.799953 -122.398525 44 Civic Center/UN Plaza BART Station (Market St ... 37.781074 ... 8_0 1 2018 2018-09 2018-09-02 6 15 13 San José 8_0 - 44_0
187640 80358 2018-05-26 19:07:02.853 2018-05-27 17:26:21.776 304 Jackson St at 5th St 37.348759 -121.894798 304 Jackson St at 5th St 37.348759 ... 304_0 0 2018 2018-05 2018-05-26 5 19 17 East Bay 304_0 - 304_0
204369 84940 2018-08-11 21:26:41.875 2018-08-12 21:02:22.341 22 Howard St at Beale St 37.789756 -122.394643 134 Valencia St at 24th St 37.752428 ... 22_0 1 18 2018-08 2018-08-11 5 21 21 San José 22_0 - 134_0
222372 84833 2018-03-30 17:00:47.966 2018-03-31 16:34:41.581 25 Howard St at 2nd St 37.787522 -122.397405 37 2nd St at Folsom St 37.785000 ... 25_0 1 2018 2018-03 2018-03-30 4 17 16 San José 25_0 - 37_0
253509 80533 2018-01-13 14:46:01.962 2018-01-14 13:08:15.778 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 56 Koshland Park 37.773414 ... 3_0 1 2018 2018-01 2018-01-13 5 14 13 San José 3_0 - 56_0
262094 81387 2018-05-16 14:11:53.475 2018-05-17 12:48:20.808 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 80 Townsend St at 5th St 37.775306 ... 3_0 1 34 2018-05 2018-05-16 2 14 12 San José 3_0 - 80_0
279961 85983 2018-11-01 16:56:02.016 2018-11-02 16:49:05.265 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 56 Koshland Park 37.773414 ... 3_0 1 26 2018-11 2018-11-01 3 16 16 San José 3_0 - 56_0
289076 84548 2019-02-16 15:48:25.029 2019-02-17 15:17:33.080 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 368 Myrtle St at Polk St 37.785434 ... 3_0 1 37 2019-02 2019-02-16 5 15 15 San José 3_0 - 368_0
344652 83579 2018-07-27 16:47:25.305 2018-07-28 16:00:24.406 20 Mechanics Monument Plaza (Market St at Bush St) 37.791300 -122.399051 3 Powell St BART Station (Market St at 4th St) 37.786375 ... 20_0 1 2018 2018-07 2018-07-27 4 16 16 San José 20_0 - 3_0
345532 82966 2018-08-16 15:23:47.289 2018-08-17 14:26:34.050 20 Mechanics Monument Plaza (Market St at Bush St) 37.791300 -122.399051 323 Broadway at Kearny 37.798014 ... 20_0 1 2018 2018-08 2018-08-16 3 15 14 San José 20_0 - 323_0
378047 85141 2018-09-28 12:13:03.948 2018-09-29 11:52:05.564 58 Market St at 10th St 37.776619 -122.417385 343 Bryant St at 2nd St 37.783172 ... 58_0 1 34 2018-09 2018-09-28 4 12 11 San José 58_0 - 343_0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1824980 82210 2018-07-21 09:16:20.692 2018-07-22 08:06:31.276 238 MLK Jr Way at University Ave 37.871719 -122.273068 244 Shattuck Ave at Hearst Ave 37.873792 ... 238_0 2 2018 2018-07 2018-07-21 5 9 8 San Francisco 238_0 - 244_0
1845919 85978 2018-02-10 17:19:21.884 2018-02-11 17:12:19.979 124 19th St at Florida St 37.760447 -122.410807 122 19th St at Mission St 37.760299 ... 124_0 1 2018 2018-02 2018-02-10 5 17 17 San José 124_0 - 122_0
1861224 83140 2018-02-07 12:18:26.407 2018-02-08 11:24:06.830 270 Ninth St at Heinz Ave 37.853489 -122.289415 251 California St at University Ave 37.870555 ... 270_0 2 2018 2018-02 2018-02-07 2 12 11 San Francisco 270_0 - 251_0
1869387 80266 2018-04-30 08:55:57.990 2018-05-01 07:13:44.795 339 Jackson St at 11th St 37.800002 -122.266438 202 Washington St at 8th St 37.800754 ... 339_0 2 34 2018-04 2018-04-30 0 8 7 San Francisco 339_0 - 202_0
1873478 85967 2018-08-20 12:12:56.340 2018-08-21 12:05:44.330 151 53rd St at Hollis St 37.836182 -122.287180 259 Addison St at Fourth St 37.866249 ... 151_0 2 28 2018-08 2018-08-20 0 12 12 San Francisco 151_0 - 259_0
1885072 83788 2018-02-13 09:00:21.081 2018-02-14 08:16:49.420 108 16th St Mission BART 37.764710 -122.419957 114 Rhode Island St at 17th St 37.764478 ... 108_0 1 37 2018-02 2018-02-13 1 9 8 San José 108_0 - 114_0
1887707 83382 2018-07-26 17:25:40.427 2018-07-27 16:35:23.152 108 16th St Mission BART 37.764710 -122.419957 129 Harrison St at 20th St 37.758862 ... 108_0 1 45 2018-07 2018-07-26 3 17 16 San José 108_0 - 129_0
1896368 81087 2018-06-15 14:10:16.323 2018-06-16 12:41:44.202 337 Webster St at 19th St 37.806970 -122.266588 197 El Embarcadero at Grand Ave 37.808848 ... 337_0 2 2018 2018-06 2018-06-15 4 14 12 San Francisco 337_0 - 197_0
1900474 84468 2018-04-22 15:56:35.641 2018-04-23 15:24:23.823 160 West Oakland BART Station 37.805318 -122.294837 160 West Oakland BART Station 37.805318 ... 160_0 2 2018 2018-04 2018-04-22 6 15 15 San Francisco 160_0 - 160_0
1900475 84490 2018-04-22 15:56:43.289 2018-04-23 15:24:53.408 160 West Oakland BART Station 37.805318 -122.294837 160 West Oakland BART Station 37.805318 ... 160_0 2 2018 2018-04 2018-04-22 6 15 15 San Francisco 160_0 - 160_0
1915880 86205 2018-10-30 17:24:11.110 2018-10-31 17:20:56.798 276 Julian St at The Alameda 37.332233 -121.912517 277 Morrison Ave at Julian St 37.333658 ... 276_0 0 23 2018-10 2018-10-30 1 17 17 East Bay 276_0 - 277_0
1921487 82708 2018-11-24 15:29:33.860 2018-11-25 14:28:02.077 84 Duboce Park 37.769201 -122.433812 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 ... 84_0 1 39 2018-11 2018-11-24 5 15 14 San José 84_0 - 21_0
1923811 81657 2018-06-08 10:00:06.684 2018-06-09 08:41:03.747 274 Oregon St at Adeline St 37.857567 -122.267558 241 Ashby BART Station 37.852477 ... 274_0 2 52 2018-06 2018-06-08 4 10 8 San Francisco 274_0 - 241_0
1936901 85175 2018-12-25 10:09:32.467 2018-12-26 09:49:07.622 188 Dover St at 57th St 37.842630 -122.267738 188 Dover St at 57th St 37.842630 ... 188_0 2 27 2018-12 2018-12-25 1 10 9 San Francisco 188_0 - 188_0
1938898 81712 2018-01-14 13:13:21.061 2018-01-15 11:55:13.325 191 Market St at 40th St 37.830545 -122.273937 191 Market St at 40th St 37.830545 ... 191_0 2 2018 2018-01 2018-01-14 6 13 11 San Francisco 191_0 - 191_0
1951310 83380 2018-07-31 09:22:29.100 2018-08-01 08:32:09.270 213 32nd St at Adeline St 37.823847 -122.281193 198 Snow Park 37.807813 ... 213_0 2 36 2018-07 2018-07-31 1 9 8 San Francisco 213_0 - 198_0
1952921 81829 2018-05-03 09:04:11.042 2018-05-04 07:48:00.917 251 California St at University Ave 37.870555 -122.279720 258 University Ave at Oxford St 37.872355 ... 251_0 2 30 2018-05 2018-05-03 3 9 7 San Francisco 251_0 - 258_0
1975471 85475 2018-08-08 21:53:59.379 2018-08-09 21:38:35.152 237 Fruitvale BART Station 37.775232 -122.224498 227 Foothill Blvd at Fruitvale Ave 37.783757 ... 237_0 2 24 2018-08 2018-08-08 2 21 21 San Francisco 237_0 - 227_0
1976419 81413 2018-03-23 15:02:35.805 2018-03-24 13:39:29.711 265 Ninth St at Parker St 37.858868 -122.291209 270 Ninth St at Heinz Ave 37.853489 ... 265_0 2 2018 2018-03 2018-03-23 4 15 13 San Francisco 265_0 - 270_0
1982628 82507 2018-03-28 11:29:58.565 2018-03-29 10:25:06.542 206 College Ave at Bryant Ave 37.838127 -122.251271 172 College Ave at Taft Ave 37.841800 ... 206_0 2 2018 2018-03 2018-03-28 2 11 10 San Francisco 206_0 - 172_0
1994241 80825 2019-01-31 17:57:44.613 2019-02-01 16:24:49.864 229 Foothill Blvd at 42nd Ave 37.775745 -122.213037 196 Grand Ave at Perkins St 37.808894 ... 229_0 2 29 2019-01 2019-01-31 3 17 16 San Francisco 229_0 - 196_0
2016831 80096 2018-04-10 15:15:49.988 2018-04-11 13:30:46.255 217 27th St at MLK Jr Way 37.817015 -122.271761 220 San Pablo Ave at MLK Jr Way 37.811351 ... 217_0 2 34 2018-04 2018-04-10 1 15 13 San Francisco 217_0 - 220_0
2022013 86010 2018-02-11 10:08:57.499 2018-02-12 10:02:27.851 272 Woolsey St at Sacramento St 37.850578 -122.278175 266 Parker St at Fulton St 37.862464 ... 272_0 2 2018 2018-02 2018-02-11 6 10 10 San Francisco 272_0 - 266_0
2065793 81954 2018-07-29 13:00:25.735 2018-07-30 11:46:20.154 155 Emeryville Public Market 37.840521 -122.293528 245 Downtown Berkeley BART 37.870348 ... 155_0 2 37 2018-07 2018-07-29 6 13 11 San Francisco 155_0 - 245_0
2069637 84188 2018-05-19 12:05:34.760 2018-05-20 11:28:43.136 168 Alcatraz Ave at Shattuck Ave 37.849595 -122.265569 176 MacArthur BART Station 37.828410 ... 168_0 2 25 2018-05 2018-05-19 5 12 11 San Francisco 168_0 - 176_0
2070083 84463 2018-08-09 17:28:55.636 2018-08-10 16:56:39.175 168 Alcatraz Ave at Shattuck Ave 37.849595 -122.265569 204 55th St at Telegraph Ave 37.840186 ... 168_0 2 32 2018-08 2018-08-09 3 17 16 San Francisco 168_0 - 204_0
2143544 80943 2018-12-14 08:29:11.531 2018-12-15 06:58:14.586 87 Folsom St at 13th St 37.769757 -122.415674 110 17th & Folsom Street Park (17th St at Folsom St) 37.763708 ... 87_0 1 42 2018-12 2018-12-14 4 8 6 San José 87_0 - 110_0
2173664 83175 2018-09-05 19:09:47.297 2018-09-06 18:16:02.853 321 5th St at Folsom 37.780146 -122.403071 93 4th St at Mission Bay Blvd S 37.770407 ... 321_1 1 30 2018-09 2018-09-05 2 19 18 San José 321_1 - 93_0
2212182 82512 2019-02-14 13:56:21.728 2019-02-15 12:51:34.315 368 Myrtle St at Polk St 37.785434 -122.419622 44 Civic Center/UN Plaza BART Station (Market St ... 37.781074 ... 368_0 1 20 2019-02 2019-02-14 3 13 12 San José 368_0 - 44_0
2217467 82385 2019-02-02 13:07:43.036 2019-02-03 12:00:48.175 377 Fell St at Stanyan St 37.771917 -122.453704 377 Fell St at Stanyan St 37.771917 ... 377_0 1 2018 2019-02 2019-02-02 5 13 12 San José 377_0 - 377_0

185 rows × 28 columns

Or some users really rent these bikes for this duration. For the final part of this assessing we will look on the missing values in the birth_year column.

In [339]:
df[df.member_birth_year.isnull()].head()
Out[339]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
1 85422 2018-01-31 16:13:34.3510 2018-02-01 15:57:17.3100 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 2815 Customer NaN NaN No
3 61076 2018-01-31 14:53:23.5620 2018-02-01 07:51:20.5000 75.0 Market St at Franklin St 37.773793 -122.421239 47.0 4th St at Harrison St 37.780955 -122.399749 321 Customer NaN NaN No
5 6477 2018-01-31 22:58:44.8670 2018-02-01 00:46:41.9080 236.0 Market St at 8th St 37.803686 -122.282497 160.0 West Oakland BART Station 37.805318 -122.294837 1306 Customer NaN NaN No
10 1316 2018-01-31 23:24:00.4160 2018-01-31 23:45:56.8240 98.0 Valencia St at 16th St 37.765052 -122.421866 4.0 Cyril Magnin St at Ellis St 37.785881 -122.408915 1278 Customer NaN NaN No
18 350 2018-01-31 23:13:45.5290 2018-01-31 23:19:36.2650 247.0 Fulton St at Bancroft Way 37.867789 -122.265896 274.0 Oregon St at Adeline St 37.857567 -122.267558 431 Subscriber NaN NaN No

It is possible that the users don't have to write down their birth day information during registration. Or they decided to not share this information at all.

Assessing Summary

Quality

df table
  • there are missing values in the station_id, station_name, member_birth_year and member_gender columns
  • the columns 'start_time' and 'end_time' are not datetime type
  • the columns 'start_station_id', 'end_station_id', 'member_birth_year' and 'bike_id' are not object type
  • some stations share the same ID while they changed the position over time (consistency problem)

Tidiness

df table
  • None

Data Cleaning

Issue #1

Define #1

  • there are missing values in the station_id, station_name, member_birth_year and member_gender columns

Clean #1

In [279]:
#there are multiple ways how to handle this - but since the coordinates are not precise we will just drop them
df.dropna(subset = ["start_station_id"], inplace = True)
In [280]:
#fill up the missing values
df.member_birth_year.fillna(0, inplace = True)
df.member_gender.fillna("not defined", inplace = True)

Test #1

In [281]:
ms.matrix(df)
Out[281]:
<matplotlib.axes._subplots.AxesSubplot at 0x230386d6438>

Issue #2

Define #2

  • the columns 'start_time' and 'end_time' are not datetime type
  • the columns 'start_station_id', 'end_station_id', 'member_birth_year' and 'bike_id' are not object type

Clean #2

In [282]:
for col in ["start_time", "end_time"]:
    df[col] = pd.to_datetime(df[col])
    
for col in ["start_station_id", "end_station_id"]:
    df[col] = df[col].astype("int")

for col in ["start_station_id", "end_station_id", "member_birth_year", "bike_id"]:
    df[col] = df[col].astype("str")

Test #2

In [283]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2226999 entries, 0 to 2239214
Data columns (total 16 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
dtypes: datetime64[ns](2), float64(4), int64(1), object(9)
memory usage: 288.8+ MB

Issue #3

Define #3

  • some stations share the same ID while they changed the position over time (consistency problem)

Clean #3

On the one hand we can ignore most of these cases, because the different stations are relatively close to each other. That raises the question: How close is close? Another approach is to give the ID's a new "subindex". So everytime the coordinates of a station ID differs from the occurrence of this station before, we will increase the subindex by 1, if the calcualted difference is > x.

In [284]:
df.sort_values("start_time", inplace = True)
df.reset_index(inplace = True, drop = True)
In [285]:
df.head()
Out[285]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 13709 2018-01-01 00:01:53.847 2018-01-01 03:50:23.433 86 Market St at Dolores St 37.769305 -122.426826 119 18th St at Noe St 37.761047 -122.432642 2924 Customer 0.0 not defined No
1 145 2018-01-01 00:07:41.040 2018-01-01 00:10:06.241 316 San Salvador St at 1st St 37.330165 -121.885831 311 Paseo De San Antonio at 2nd St 37.333798 -121.886943 2473 Subscriber 1957.0 Male No
2 714 2018-01-01 00:07:52.943 2018-01-01 00:19:47.075 74 Laguna St at Hayes St 37.776435 -122.426244 70 Central Ave at Fell St 37.773311 -122.444293 2423 Subscriber 1986.0 Male No
3 1151 2018-01-01 00:09:31.745 2018-01-01 00:28:43.159 97 14th St at Mission St 37.768265 -122.420110 125 20th St at Bryant St 37.759200 -122.409851 3455 Subscriber 1992.0 Male No
4 1359 2018-01-01 00:14:37.187 2018-01-01 00:37:16.445 11 Davis St at Jackson St 37.797280 -122.398436 323 Broadway at Kearny 37.798014 -122.405950 2173 Customer 1995.0 Male No
In [286]:
#id_data = {}

def get_new_id(col):
    
    #when the row is not in id_data - append it
    if col[0] not in id_data:
        id_data[col[0]] = [col[1], col[2], col[3], f"{col[0]}_0"]
        return id_data[col[0]][3]
    
    #if the row exists in id_data, then check if the coordinates change, if yes - calculate the distance and increase the id and replace the \
    #saved coordinates in id_data with the new ones, if not, then just return the saved id
    elif col[0] in id_data:
        if id_data[col[0]][1] != col[2] or id_data[col[0]][2] != col[3]:
            coords_1 = (id_data[col[0]][1], id_data[col[0]][2])
            coords_2 = (col[2], col[3])
                
            if geopy.distance.distance(coords_1, coords_2).m > 100:
                new_ind = str(col[0]) + "_" + str(int(id_data[col[0]][3][-1])+1)
                id_data[col[0]][3] = new_ind
                id_data[col[0]][1] = col[2]
                id_data[col[0]][2] = col[3]

                return new_ind
            else:
                return id_data[col[0]][3]

        else:
            return id_data[col[0]][3]
            
    else:
        return "Error"   
In [287]:
id_data = {}
df["start_station_id_new"] = df[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude"]].apply(get_new_id, axis = 1)
In [288]:
id_data = {}
df["end_station_id_new"] = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].apply(get_new_id, axis = 1)
In [289]:
df_station_names.head()
Out[289]:
id station_name station_latitude station_longitude new_id
0 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 3.0_0
1 4.0 Cyril Magnin St at Ellis St 37.785881 -122.408915 4.0_0
2 5.0 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 5.0_0
3 6.0 The Embarcadero at Sansome St 37.804770 -122.403234 6.0_0
4 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 7.0_0
In [290]:
df_station_names = df[["start_time", "start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude"]].copy()
df_station_names.sort_values("start_time", inplace = True)
df_station_names.drop("start_time", axis = 1, inplace = True)
df_station_names.rename(columns={"start_station_id": "id", 
                                 "start_station_name": "station_name", 
                                 "start_station_latitude": "station_latitude", 
                                 "start_station_longitude": "station_longitude"}, inplace = True)

df_station_names.drop_duplicates(inplace = True)
df_station_names.dropna(inplace = True)
id_data = {}
df_station_names["new_id"] = df_station_names[["id", "station_name", "station_latitude", "station_longitude"]].apply(get_new_id, axis = 1)
df_station_names.sort_values("id", inplace = True)
df_station_names.reset_index(inplace = True, drop = True)

Test #3

In [291]:
df.query("start_station_id == '208'").drop_duplicates("start_station_id_new")[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new"]]
Out[291]:
start_station_id start_station_name start_station_latitude start_station_longitude start_station_id_new
556 208 S. 4th St at San Carlos St 37.332836 -121.883912 208_0
817544 208 S. 4th St at San Carlos St 37.330040 -121.881821 208_1
In [292]:
df.query("end_station_id == '208'").drop_duplicates("end_station_id_new")[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude", "end_station_id_new"]]
Out[292]:
end_station_id end_station_name end_station_latitude end_station_longitude end_station_id_new
478 208 S. 4th St at San Carlos St 37.332836 -121.883912 208_0
817103 208 S. 4th St at San Carlos St 37.330040 -121.881821 208_1
In [293]:
#export the station names to csv
df_station_names.to_csv("df_station_names.csv", index = False)
In [294]:
#export the cleaned csv 
df.to_csv("result_clean.csv", index = False)

Exploratory Data Visualization

Now that the cleaning part is done, we can start to visualize the data.

In [2]:
df = pd.read_csv("result_clean.csv")
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226999 entries, 0 to 2226998
Data columns (total 18 columns):
duration_sec               int64
start_time                 object
end_time                   object
start_station_id           int64
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             int64
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    int64
user_type                  object
member_birth_year          float64
member_gender              object
bike_share_for_all_trip    object
start_station_id_new       object
end_station_id_new         object
dtypes: float64(5), int64(4), object(9)
memory usage: 305.8+ MB
In [4]:
#wrong datatypes again, maybe change the datatype to HDF5
for col in ["start_time", "end_time"]:
    df[col] = pd.to_datetime(df[col])
    
for col in ["member_birth_year"]:
    df[col] = df[col].astype("int")

for col in ["start_station_id", "end_station_id", "member_birth_year", "bike_id"]:
    df[col] = df[col].astype("str")
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226999 entries, 0 to 2226998
Data columns (total 18 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
start_station_id_new       object
end_station_id_new         object
dtypes: datetime64[ns](2), float64(4), int64(1), object(11)
memory usage: 305.8+ MB
In [6]:
df.head()
Out[6]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_station_id_new end_station_id_new
0 13709 2018-01-01 00:01:53.847 2018-01-01 03:50:23.433 86 Market St at Dolores St 37.769305 -122.426826 119 18th St at Noe St 37.761047 -122.432642 2924 Customer 0 not defined No 86_0 119_0
1 145 2018-01-01 00:07:41.040 2018-01-01 00:10:06.241 316 San Salvador St at 1st St 37.330165 -121.885831 311 Paseo De San Antonio at 2nd St 37.333798 -121.886943 2473 Subscriber 1957 Male No 316_0 311_0
2 714 2018-01-01 00:07:52.943 2018-01-01 00:19:47.075 74 Laguna St at Hayes St 37.776435 -122.426244 70 Central Ave at Fell St 37.773311 -122.444293 2423 Subscriber 1986 Male No 74_0 70_0
3 1151 2018-01-01 00:09:31.745 2018-01-01 00:28:43.159 97 14th St at Mission St 37.768265 -122.420110 125 20th St at Bryant St 37.759200 -122.409851 3455 Subscriber 1992 Male No 97_0 125_0
4 1359 2018-01-01 00:14:37.187 2018-01-01 00:37:16.445 11 Davis St at Jackson St 37.797280 -122.398436 323 Broadway at Kearny 37.798014 -122.405950 2173 Customer 1995 Male No 11_0 323_0
In [7]:
df_station_names = pd.read_csv("df_station_names.csv")
In [8]:
df_station_names.head()
Out[8]:
id station_name station_latitude station_longitude new_id
0 10 Washington St at Kearny St 37.795393 -122.404770 10_0
1 100 Bryant St at 15th St 37.767100 -122.410662 100_0
2 101 San Bruno Ave at 16th St 37.766008 -122.405677 101_0
3 101 15th St at Potrero Ave 37.767079 -122.407359 101_1
4 101 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706 101_1

Before we start visualizing, we can extract some additional information out of the data to improve the insights. We saw, that these datapoints can be separated in three clusters. Since these clusters are really obvious, we can classify them really efficient using the K-Means Clustering algorithm.

In [9]:
#https://datatofish.com/k-means-clustering-python/

kmeans = KMeans(n_clusters=3).fit(df_station_names[["station_longitude", "station_latitude"]])

df_station_names["label"] = kmeans.labels_

for x in set(list(df_station_names.label)):
    
    df_plot_cluster = df_station_names.query(f"label == {x}")

    plt.scatter(df_plot_cluster['station_longitude'], df_plot_cluster["station_latitude"], s=50, alpha=0.5, label = x);

plt.legend();

Now that this is done, we can visualize this data on a map. This can happen with Bokeh or Plotly, but we will use kepler.gl out of the visualization toolbox for map visualizations.

Source: kepler.gl

All Stations Upper Two Cluster Lower Cluster

Now we will map the labels with the original names.

In [10]:
mapping = {0: "San Francisco", 1: "San José", 2: "East Bay"}
In [11]:
df_station_names["label_name"] = df_station_names["label"].map(mapping)
In [12]:
df_station_names.head()
Out[12]:
id station_name station_latitude station_longitude new_id label label_name
0 10 Washington St at Kearny St 37.795393 -122.404770 10_0 0 San Francisco
1 100 Bryant St at 15th St 37.767100 -122.410662 100_0 0 San Francisco
2 101 San Bruno Ave at 16th St 37.766008 -122.405677 101_0 0 San Francisco
3 101 15th St at Potrero Ave 37.767079 -122.407359 101_1 0 San Francisco
4 101 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706 101_1 0 San Francisco
In [13]:
df_station_names.label_name.value_counts()
Out[13]:
San Francisco    172
East Bay         146
San José          56
Name: label_name, dtype: int64
In [14]:
df_station_names[df_station_names.duplicated("new_id", keep = False)] # duplicates for new coordinates which are not > 100 m away from the origin 
Out[14]:
id station_name station_latitude station_longitude new_id label label_name
3 101 15th St at Potrero Ave 37.767079 -122.407359 101_1 0 San Francisco
4 101 Potrero Ave at 15th St (Temporary Location) 37.766629 -122.407706 101_1 0 San Francisco
31 130 22nd St Caltrain Station 37.757718 -122.391813 130_0 0 San Francisco
32 130 22nd St Caltrain Station 37.757288 -122.392051 130_0 0 San Francisco
33 130 22nd St Caltrain Station 37.757369 -122.392057 130_0 0 San Francisco
76 173 Shattuck Ave at 55th Ave 37.840364 -122.264488 173_0 2 East Bay
77 173 Shattuck Ave at 55th St 37.840364 -122.264488 173_0 2 East Bay
111 205 Miles Ave at Cavour St 37.838800 -122.258732 205_1 2 East Bay
113 205 Miles Ave at Cavour St (Temporary Location) 37.838800 -122.258732 205_1 2 East Bay
116 208 William St at 4th St (Temporary Location) 37.329964 -121.881922 208_1 1 San José
118 208 S. 4th St at San Carlos St 37.330040 -121.881821 208_1 1 San José
135 221 E 12th St at 6th Ave 37.794353 -122.253875 221_0 2 East Bay
136 221 6th Ave at E 12th St (Temporary Location) 37.794396 -122.253842 221_0 2 East Bay
137 221 12th St at 6th Ave 37.794353 -122.253875 221_0 2 East Bay
140 224 21st Ave at International Blvd 37.785157 -122.238915 224_0 2 East Bay
141 224 21st Ave at International Blvd 37.784855 -122.239305 224_0 2 East Bay
151 233 E 12th St at 4th Ave 37.795812 -122.255555 233_0 2 East Bay
152 233 4th Ave at E 12th St (Temporary Location) 37.795913 -122.255547 233_0 2 East Bay
153 233 12th St at 4th Ave 37.795812 -122.255555 233_0 2 East Bay
154 234 Fruitvale Ave at International Blvd 37.777682 -122.225771 234_0 2 East Bay
155 234 Farnam St at Fruitvale Ave 37.778058 -122.225400 234_0 2 East Bay
166 244 Shattuck Ave at Hearst Ave 37.873748 -122.268648 244_0 2 East Bay
167 244 Shattuck Ave at Hearst Ave 37.873676 -122.268487 244_0 2 East Bay
168 244 Shattuck Ave at Hearst Ave 37.873792 -122.268618 244_0 2 East Bay
169 245 Downtown Berkeley BART 37.870348 -122.267764 245_0 2 East Bay
170 245 Downtown Berkeley BART 37.870139 -122.268422 245_0 2 East Bay
176 250 North Berkeley BART Station 37.874014 -122.283019 250_0 2 East Bay
177 250 North Berkeley BART Station 37.873558 -122.283093 250_0 2 East Bay
198 272 Woolsey St at Sacramento St 37.850578 -122.278175 272_0 2 East Bay
199 272 Woolsey St at Sacramento St1 37.850578 -122.278175 272_0 2 East Bay
208 280 San Fernando St at 7th St 37.337122 -121.883215 280_0 1 San José
209 280 San Fernando at 7th St 37.337246 -121.883074 280_0 1 San José
210 280 6th St at San Fernando St (Temporary Location) 37.337038 -121.884125 280_0 1 San José
211 281 9th St at San Fernando St 37.338395 -121.880797 281_0 1 San José
212 281 9th St at San Fernando 37.338395 -121.880797 281_0 1 San José
255 321 5th St at Folsom 37.780146 -122.403071 321_0 0 San Francisco
256 321 5th at Folsom 37.780146 -122.403071 321_0 0 San Francisco
270 344 16th St Depot 37.766349 -122.396289 344_0 0 San Francisco
271 344 16th St Depot 37.766349 -122.396292 344_0 0 San Francisco
272 345 Hubbell St at 16th St 37.766483 -122.398279 345_0 0 San Francisco
273 345 Hubbell St at 16th St 37.766474 -122.398295 345_0 0 San Francisco
282 358 Lane St at Van Dyke Ave 37.729252 -122.392378 358_0 0 San Francisco
283 358 Williams Ave at 3rd St 37.729279 -122.392896 358_0 0 San Francisco
290 364 China Basin St at 3rd St 37.772000 -122.389970 364_0 0 San Francisco
291 364 Mission Rock St at 3rd St 37.772886 -122.389940 364_0 0 San Francisco
355 80 Townsend St at 5th St 37.775306 -122.397380 80_0 0 San Francisco
356 80 Townsend St at 5th St 37.775235 -122.397437 80_0 0 San Francisco
In [15]:
df_station_names.drop_duplicates(subset = ["new_id"], inplace = True)
In [16]:
df_station_names.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 348 entries, 0 to 373
Data columns (total 7 columns):
id                   348 non-null int64
station_name         348 non-null object
station_latitude     348 non-null float64
station_longitude    348 non-null float64
new_id               348 non-null object
label                348 non-null int32
label_name           348 non-null object
dtypes: float64(2), int32(1), int64(1), object(3)
memory usage: 20.4+ KB
In [17]:
df[df.duplicated(subset = ["start_time", "end_time"], keep = False)]
Out[17]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_station_id_new end_station_id_new
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226999 entries, 0 to 2226998
Data columns (total 18 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
start_station_id_new       object
end_station_id_new         object
dtypes: datetime64[ns](2), float64(4), int64(1), object(11)
memory usage: 305.8+ MB

Now we can merge the labels to the id's of the main dataframe.

In [19]:
df = df.merge(df_station_names[["new_id", "label"]], left_on = "start_station_id_new", right_on = "new_id", how = "outer")
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2226999 entries, 0 to 2226998
Data columns (total 20 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
start_station_id_new       object
end_station_id_new         object
new_id                     object
label                      int32
dtypes: datetime64[ns](2), float64(4), int32(1), int64(1), object(12)
memory usage: 348.3+ MB
In [21]:
df[df.label.isnull()].start_station_id_new.value_counts()
Out[21]:
Series([], Name: start_station_id_new, dtype: int64)
In [22]:
df.query("start_station_id == '205'").drop_duplicates("start_station_id_new")[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new", "label"]]
Out[22]:
start_station_id start_station_name start_station_latitude start_station_longitude start_station_id_new label
1157345 205 Shafter Ave at Cavour St 37.837947 -122.257243 205_0 2
2178748 205 Miles Ave at Cavour St (Temporary Location) 37.838800 -122.258732 205_1 2
In [23]:
df["label_name"] = df["label"].map(mapping)
In [24]:
df.label_name.value_counts()
Out[24]:
San Francisco    1639401
East Bay          485045
San José          102553
Name: label_name, dtype: int64
In [25]:
for x in set(list(df.label)):
    
    df_plot_cluster = df.query(f"label == {x}")

    plt.scatter(df_plot_cluster['start_station_longitude'], df_plot_cluster["start_station_latitude"], s=50, alpha=0.5, label = x);

plt.legend();

Now we will extract multiple other informations out of the birth year and the start_time - timestamp.

In [26]:
df["age"] = df["member_birth_year"].apply(lambda x: 2018 - int(x))
In [27]:
df['month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('M')
In [28]:
df['day_month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('D')
In [29]:
df["dayofweek"] = df["start_time"].apply(lambda x: x.dayofweek)
In [30]:
df["start_hr"] = df["start_time"].apply(lambda x: x.hour)
df["end_hr"] = df["end_time"].apply(lambda x: x.hour)
In [31]:
df_age = df.query("age != 2018 and age < 100").copy()
In [32]:
df_age.age.min()
Out[32]:
17
In [33]:
bins = [x for x in range(10,101, 10)]
df_age["age_bins"] = pd.cut(df_age.age, bins = bins, precision = 0, include_lowest=False)
In [34]:
df_age[["age", "age_bins"]].head()
Out[34]:
age age_bins
2 61 (60, 70]
3 31 (30, 40]
5 54 (50, 60]
6 34 (30, 40]
7 37 (30, 40]
In [35]:
df.head()
Out[35]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... end_station_id_new new_id label label_name age month_year day_month_year dayofweek start_hr end_hr
0 13709 2018-01-01 00:01:53.847 2018-01-01 03:50:23.433 86 Market St at Dolores St 37.769305 -122.426826 119 18th St at Noe St 37.761047 ... 119_0 86_0 0 San Francisco 2018 2018-01 2018-01-01 0 0 3
1 9996 2018-01-01 04:37:08.115 2018-01-01 07:23:44.143 86 Market St at Dolores St 37.769305 -122.426826 223 16th St Mission BART Station 2 37.764765 ... 223_0 86_0 0 San Francisco 2018 2018-01 2018-01-01 0 4 7
2 197 2018-01-01 09:29:57.079 2018-01-01 09:33:14.662 86 Market St at Dolores St 37.769305 -122.426826 85 Church St at Duboce Ave 37.770083 ... 85_0 86_0 0 San Francisco 61 2018-01 2018-01-01 0 9 9
3 183 2018-01-01 11:25:27.634 2018-01-01 11:28:31.059 86 Market St at Dolores St 37.769305 -122.426826 75 Market St at Franklin St 37.773793 ... 75_0 86_0 0 San Francisco 31 2018-01 2018-01-01 0 11 11
4 893 2018-01-01 12:07:39.271 2018-01-01 12:22:32.393 86 Market St at Dolores St 37.769305 -122.426826 42 San Francisco City Hall (Polk St at Grove St) 37.778650 ... 42_0 86_0 0 San Francisco 2018 2018-01 2018-01-01 0 12 12

5 rows × 27 columns

Now we are really ready for the exploratory visualizations.

In [40]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2226999 entries, 0 to 2226998
Data columns (total 27 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
start_station_id_new       object
end_station_id_new         object
new_id                     object
label                      int32
label_name                 object
age                        int64
month_year                 period[M]
day_month_year             period[D]
dayofweek                  int64
start_hr                   int64
end_hr                     int64
dtypes: datetime64[ns](2), float64(4), int32(1), int64(5), object(13), period[D](1), period[M](1)
memory usage: 467.2+ MB

Univariate Exploration

There are mutliple interesting variables in thsi dataset. Let's start with the stations first.

In [41]:
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/

value_ct = df.start_station_id_new.value_counts().iloc[:25]

fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df, order=value_ct.index, color = color);

plt.ylim(0,70000)

cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)

sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-4000), color = "white")

plt.title("Top 25 Start Stations");
plt.xlabel("Start Station ID");

Overall we can see, that 67_0, 15_0 and 58_0 are the most "used" stations in this dataset. Let's take a look on each group separately.

In [43]:
for x in value_ct.index:
    print(x + " - " + str(df.query(f"start_station_id_new == '{x}'").start_station_name.drop_duplicates().get_values()))
67_0 - ['San Francisco Caltrain Station 2  (Townsend St at 4th St)']
15_0 - ['San Francisco Ferry Building (Harry Bridges Plaza)']
58_0 - ['Market St at 10th St']
30_0 - ['San Francisco Caltrain (Townsend St at 4th St)']
81_0 - ['Berry St at 4th St']
21_0 - ['Montgomery St BART Station (Market St at 2nd St)']
6_0 - ['The Embarcadero at Sansome St']
3_0 - ['Powell St BART Station (Market St at 4th St)']
16_0 - ['Steuart St at Market St']
22_0 - ['Howard St at Beale St']
5_0 - ['Powell St BART Station (Market St at 5th St)']
17_0 - ['Embarcadero BART Station (Beale St at Market St)']
66_0 - ['3rd St at Townsend St']
90_0 - ['Townsend St at 7th St']
50_0 - ['2nd St at Townsend St']
27_0 - ['Beale St at Harrison St']
93_0 - ['4th St at Mission Bay Blvd S']
44_0 - ['Civic Center/UN Plaza BART Station (Market St at McAllister St)']
23_0 - ['The Embarcadero at Steuart St']
19_0 - ['Post St at Kearny St']
134_0 - ['Valencia St at 24th St']
126_0 - ['Esprit Park']
24_0 - ['Spear St at Folsom St']
324_0 - ['Union Square (Powell St at Post St)']
20_0 - ['Mechanics Monument Plaza (Market St at Bush St)']
In [44]:
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/

df_new = df.query("label == 0").copy()

value_ct = df_new.start_station_id_new.value_counts().iloc[:25]

fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);

plt.ylim(0,50000)

cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)

sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")

plt.title("Top 25 Start Stations for San Francisco");
plt.xlabel("Start Station ID");

For San Francisco we can see that this City is leading the trip counter overall.

In [45]:
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/

df_new = df.query("label == 1").copy()

value_ct = df_new.start_station_id_new.value_counts().iloc[:25]

fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);

plt.ylim(0,50000)

cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)

sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")

plt.title("Top 25 Start Stations for San José");
plt.xlabel("Start Station ID");
In [46]:
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/

df_new = df.query("label == 2").copy()

value_ct = df_new.start_station_id_new.value_counts().iloc[:25]

fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);

plt.ylim(0,50000)

cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)

sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")

plt.title("Top 25 Start Stations for East Bay");
plt.xlabel("Start Station ID");

It seems like the Ford GoBike Program is relatively new in East Bay and San José. These parts have lesser trips than San Francisco. In East Bay the stations 182_0, 243_0 and 176_0 are popular. For San José are the top three 310_0, 296_0 and 312_0. Now that we know, that San Francisco is the city in this project with the most active users, we will now take a look on the duration.

In [47]:
#maybe a customer forgot to log off
bin_size = 100
bins = np.arange(0,df.duration_sec.max()+bin_size,bin_size)

fig, axes = plt.subplots(figsize = (12,5), dpi = 110)

plt.hist(df.duration_sec, bins = bins, color= color);
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.xlim(-100,6000);
plt.tight_layout()

This graph is limited to 6000 seconds to exclude the 'outliers'. The main takeaway here is that the most trips have a duration between 250 and 550 seconds.

In [49]:
df.query("duration_sec < 6000").duration_sec.mean()
Out[49]:
694.5217216880001
In [338]:
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
for x in mapping.values():
    df_new = df.query(f"label_name == '{x}'")

    bin_size = 100
    bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)   

    plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step');
    
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()

Looking at these, trends are looking similar to each other (right skewed), although it seems like trips in East Bay are usually a little bit shorter in duration.

In [58]:
for x in mapping.values():
    print(x, df.query(f"label_name == '{x}' and duration_sec < 6000").duration_sec.mean())
San Francisco 717.4930727719006
San José 627.4785481464028
East Bay 631.2436195665932

The mean values also agree on that.

In [59]:
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'San Francisco'")

bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)   

plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step', color = "g");
    
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
In [54]:
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'East Bay'")

bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)   

plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step');
    
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
In [60]:
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'San José'")

bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)   

plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step', color = "orange");
    
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()

Now we will look at the user structure.

In [340]:
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "member_gender", data = df,  order=df.member_gender.value_counts().index, palette = "viridis");

cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.303, p.get_height()+50000))

plt.title("Users By Gender");
plt.xlabel("");

It seems like there are a lot more men using this service than woman, "not defined" or other.

In [61]:
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "label_name", data = df,  order=df.label_name.value_counts().index, palette = "viridis", hue = "member_gender");

cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

#for p in ax.patches:
 #   ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()+50000))

plt.title("Users By Gender");
plt.xlabel("");
In [62]:
for x in mapping.values():
    print(x, df.query(f"label_name == '{x}'").member_gender.value_counts() / df.query(f"label_name == '{x}'").member_gender.count())
San Francisco Male           0.703946
Female         0.220110
not defined    0.063101
Other          0.012844
Name: member_gender, dtype: float64
San José Male           0.773971
Female         0.182842
not defined    0.040340
Other          0.002847
Name: member_gender, dtype: float64
East Bay Male           0.642602
Female         0.288779
not defined    0.042120
Other          0.026499
Name: member_gender, dtype: float64

Looking at the plot and the relative frequencies, the male percentage is > 50% for all three areas.

In [342]:
value_ct = df.user_type.value_counts().iloc[:31]

fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "user_type", data = df, order=value_ct.index, palette = "viridis");

cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.31, p.get_height()+40000))

plt.title("Users By Type");
plt.xlabel("");
In [343]:
value_ct = df.label_name.value_counts().iloc[:31]

fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "label_name", data = df, order=value_ct.index, palette = "viridis", hue = "user_type");

cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)

for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()+40000))

plt.title("Users By Type");
plt.xlabel("");

Also there are a lot more users using the subscription service than the customer usage.

In [67]:
fig, ax = plt.subplots(figsize = (20,5), dpi = 80)
color = sns.color_palette("viridis")[2]
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73").sort_values("age"), color = color);
In [345]:
df.query("age != 2018").age.mean()
Out[345]:
34.69976603784374

Overall, the average user is between 24 and 35 years old, with no user being younger than 18 (and giving their birth year data).

In [401]:
fig, ax = plt.subplots(figsize = (20,5), dpi = 80)
color = sns.color_palette("viridis")[2]
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 2").sort_values("age"), color = 'orange', label = "San Francisco");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 0").sort_values("age"), color = 'g', label = "East Bay");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 1").sort_values("age"), color = 'b', label = "San José");
plt.legend();

The East Bay age structure is broader than the one of San Francisco. San José has the youngest average group of users.

In [69]:
for x in mapping.values():
    print(x, df.query(f"label_name == '{x}' and age != 2018").age.mean())
San Francisco 35.157000623209576
San José 29.72252479271663
East Bay 34.2428526844807

For the next plots we will focus on the time components of the data. At first we will explore on which days people like to go on trips. 0 refers to monday while 6 refers to sunday.

In [70]:
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "dayofweek", data = df, palette = "viridis");

Based on this graph it looks like users use the bikes more during the week than during the weekend.

In [348]:
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "label_name", data = df, palette = "viridis", hue = "dayofweek");
In [81]:
#https://stackoverflow.com/questions/43855474/changing-sort-in-value-counts

for x in mapping.values():
    print(f"{x}\n", df.query(f"label_name == '{x}'").dayofweek.value_counts().sort_index() / df.query(f"label_name == '{x}'").dayofweek.count(), "\n")
San Francisco
 0    0.157500
1    0.172698
2    0.172733
3    0.175843
4    0.157807
5    0.087734
6    0.075684
Name: dayofweek, dtype: float64 

San José
 0    0.154466
1    0.161126
2    0.164656
3    0.161204
4    0.143360
5    0.110401
6    0.104785
Name: dayofweek, dtype: float64 

East Bay
 0    0.153050
1    0.166949
2    0.164572
3    0.167048
4    0.150541
5    0.105875
6    0.091965
Name: dayofweek, dtype: float64 

This applies for all three areas. Tuesday, Wednesday and Thursday are the most active days, followed by Monday and Friday and then Saturday and Sunday. Now let's look on the trips per month/year.

In [350]:
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "month_year", data = df, palette = "viridis");
In [351]:
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "month_year", data = df, palette = "viridis", hue = "label_name");

For all three areas we can see that Bikesharing got more and more popular from 2018-01 - 2018-10, followed by a drop for two months during November 18 and Dezember 18.

In [86]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "label_name", data = df, palette = "viridis", hue = "month_year");
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Out[86]:
<matplotlib.legend.Legend at 0x1b27e4e6160>
In [89]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "month_year", data = df.query("label_name == 'San José'"), palette = "viridis");
In [90]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "month_year", data = df.query("label_name == 'East Bay'"), palette = "viridis");

The drop for the "San José" area held on for on more month till February 19.

In [352]:
#https://seaborn.pydata.org/generated/seaborn.color_palette.html
sns.palplot(sns.color_palette("viridis"))
In [353]:
#https://stackoverflow.com/questions/35143672/seaborn-conditional-colors-based-on-value

custom_palette = {}
for q in set(df.day_month_year):
    if q.dayofweek == 0:
        custom_palette[q] = sns.color_palette("viridis")[0]
    elif q.dayofweek == 1:
        custom_palette[q] = sns.color_palette("viridis")[1]
    elif q.dayofweek == 2:
        custom_palette[q] = sns.color_palette("viridis")[2]
    elif q.dayofweek == 3:
        custom_palette[q] = sns.color_palette("viridis")[3]
    elif q.dayofweek == 4:
        custom_palette[q] = sns.color_palette("viridis")[4]
    elif q.dayofweek == 5:
        custom_palette[q] = sns.color_palette("viridis")[5]
    elif q.dayofweek == 6:
        custom_palette[q] = (224/255, 228/255, 65/255)
    else:
        custom_palette[q] = 'g'
        
legend_obj = []

colors = [sns.color_palette("viridis")[0],
          sns.color_palette("viridis")[1],
          sns.color_palette("viridis")[2],
          sns.color_palette("viridis")[3],
          sns.color_palette("viridis")[4],
          sns.color_palette("viridis")[5],
          (224/255, 228/255, 65/255)]

days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

for i, s in enumerate(days):
    legend_obj.append(plt.scatter([],[],color = colors[i]));  
In [354]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
def plot_data_time(start_date="12.31.17", end_date="03.01.18"):
    fig, ax = plt.subplots(figsize = (30,5), dpi = 100)
    sns.countplot(x = "day_month_year", data = df[(df["start_time"] < pd.to_datetime(end_date)) & (df["start_time"] > pd.to_datetime(start_date))], palette = custom_palette);
    plt.xticks(rotation = 90);
    plt.ylim(0,10000);
    box = ax.get_position()
    ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
    # Put a legend to the right of the current axis
    ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5))
In [355]:
plot_data_time()
In [356]:
plot_data_time("03.01.18","05.01.18")
In [357]:
plot_data_time("05.01.18","07.01.18")
In [358]:
plot_data_time("07.01.18","09.01.18")
In [359]:
plot_data_time("09.01.18","11.01.18")
In [360]:
plot_data_time("11.01.18","01.01.19")
In [361]:
plot_data_time("01.01.19","03.01.19")

These plots also show the increasing count of trips per day and also the trend, that on the weekends are less trips than on weekdays. Also we can see drops in tripcount for example on the 23.11.18 or the 06.01.19.

In [362]:
fig, ax = plt.subplots(figsize = (16,5))

sns.countplot(x = "start_hr", data = df, palette = "viridis", ax = ax);

The most frequent starting hours are at 8 and at 17. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regulary. The integration into the working/study life would make sense here.

In [363]:
fig, ax = plt.subplots(figsize = (16,5))

sns.countplot(x = "start_hr", data = df.query("label == 0"), palette = "viridis", ax = ax);        
In [364]:
fig, ax = plt.subplots(figsize = (16,5))

sns.countplot(x = "start_hr", data = df.query("label == 1"), palette = "viridis", ax = ax);

For San José the trend is going more torwards the hour 17.

In [365]:
fig, ax = plt.subplots(figsize = (16,5))

sns.countplot(x = "start_hr", data = df.query("label == 2"), palette = "viridis", ax = ax);

As we saw, the most frequent hours are 8 and 17. I want to see how this trend is changing over time, so for the next plot we will connect these two counts with a line to see the trend based on the slope.

In [102]:
fig, ax = plt.subplots(14,1,figsize = (16,50))

dates = [pd.to_datetime(f"2018-{x}") for x in range(1,13)] + [pd.to_datetime(f"2019-{x}")for x in range(1,3)]

for i, date in enumerate(dates):
    try:
        sns.countplot(x = "start_hr", data = df[(df['start_time'] < dates[i+1])&(df['start_time'] >= date)], palette = "viridis", ax = ax[i]);
        ax[i].set_ylim(0,25000)
        ax[i].set_xlim(-0.5,23.5)
        ax[i].plot([8,17], [ax[i].patches[8].get_height(),ax[i].patches[17].get_height()], 'o-', color = "#3a3a3a")
        ax[i].text(0, 22000, str(date.date()))
        
    except:        
        sns.countplot(x = "start_hr", data = df[(df['start_time'] > dates[i])], palette = "viridis", ax = ax[i]);
        ax[i].set_ylim(0,25000)
        ax[i].set_xlim(-0.5,23.5)
        ax[i].plot([8,17], [ax[i].patches[8].get_height(),ax[i].patches[17].get_height()], 'o-', color = "#3a3a3a")
        ax[i].text(0, 22000, str(date.date()))

We can see that this is an ongoing trend. At 17 'o clock slightly more trips are starting in comparison to 8 'o clock (for most of the months).

In [367]:
for i, date in enumerate(dates):
    try:
        print(str(df[(df['start_time'] < dates[i+1])&(df['start_time'] >= date)].start_hr.value_counts().head(2)))
    except:
        pass
8     11480
17    11103
Name: start_hr, dtype: int64
8     12738
17    12532
Name: start_hr, dtype: int64
17    13864
8     12460
Name: start_hr, dtype: int64
17    15889
8     14596
Name: start_hr, dtype: int64
17    21765
8     19261
Name: start_hr, dtype: int64
17    22980
8     20665
Name: start_hr, dtype: int64
17    22866
8     20994
Name: start_hr, dtype: int64
17    23025
8     21762
Name: start_hr, dtype: int64
17    21836
8     19420
Name: start_hr, dtype: int64
17    24098
8     23104
Name: start_hr, dtype: int64
8     15999
17    15022
Name: start_hr, dtype: int64
17    13882
8     13602
Name: start_hr, dtype: int64
17    23316
8     21660
Name: start_hr, dtype: int64

Bivariate Exploration

Now we are going to dig deeper into the data searching for relationships and trends between variables.

In [368]:
df.head()
Out[368]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... end_station_id_new new_id label age month_year day_month_year dayofweek start_hr end_hr label_name
0 13709 2018-01-01 00:01:53.847 2018-01-01 03:50:23.433 86 Market St at Dolores St 37.769305 -122.426826 119 18th St at Noe St 37.761047 ... 119_0 86_0 2 2018 2018-01 2018-01-01 0 0 3 San Francisco
1 9996 2018-01-01 04:37:08.115 2018-01-01 07:23:44.143 86 Market St at Dolores St 37.769305 -122.426826 223 16th St Mission BART Station 2 37.764765 ... 223_0 86_0 2 2018 2018-01 2018-01-01 0 4 7 San Francisco
2 197 2018-01-01 09:29:57.079 2018-01-01 09:33:14.662 86 Market St at Dolores St 37.769305 -122.426826 85 Church St at Duboce Ave 37.770083 ... 85_0 86_0 2 61 2018-01 2018-01-01 0 9 9 San Francisco
3 183 2018-01-01 11:25:27.634 2018-01-01 11:28:31.059 86 Market St at Dolores St 37.769305 -122.426826 75 Market St at Franklin St 37.773793 ... 75_0 86_0 2 31 2018-01 2018-01-01 0 11 11 San Francisco
4 893 2018-01-01 12:07:39.271 2018-01-01 12:22:32.393 86 Market St at Dolores St 37.769305 -122.426826 42 San Francisco City Hall (Polk St at Grove St) 37.778650 ... 42_0 86_0 2 2018 2018-01 2018-01-01 0 12 12 San Francisco

5 rows × 27 columns

In [369]:
#join the trips together
df["combi"] = df["start_station_id_new"] + " - " + df["end_station_id_new"]
df.combi.value_counts().head(10)
Out[369]:
15_0 - 6_0       8352
6_0 - 16_0       5403
81_0 - 15_0      5238
196_0 - 182_0    4607
182_0 - 196_0    4355
6_0 - 15_0       4171
16_0 - 6_0       3685
15_0 - 81_0      3620
90_0 - 67_0      3417
195_0 - 182_0    3396
Name: combi, dtype: int64

One interesting insight are the most frequent trips. Based on the data in the upper cell we can see, that the station 6_0 appears often. A heatmap should make this visible.

In [370]:
combi_list = df.combi.value_counts().head(15).keys()

df_criteria = df[df["combi"].isin(combi_list)]

df_pivot = df_criteria.pivot_table(index="start_station_id_new", columns = "end_station_id_new", values = "start_time", aggfunc = "count", fill_value = 0)

fig, axes = plt.subplots(figsize = (15,10), dpi = 70)
sns.heatmap(df_pivot, annot = True, cmap = "viridis_r", fmt='g', vmin = 1, vmax = 6000, mask= df_pivot == 0)
Out[370]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f7eb44128>

Here we can see the top 15 routes.

In [371]:
for i, x in enumerate(combi_list):
    start, end = x.split(" - ")
    start_name = df.query(f"start_station_id_new == '{start}'").start_station_name.drop_duplicates().get_values()
    end_name = df.query(f"end_station_id_new == '{end}'").end_station_name.drop_duplicates().get_values()
    print(start," ",start_name[0], " - ",end, " ", end_name[0])
15_0   San Francisco Ferry Building (Harry Bridges Plaza)  -  6_0   The Embarcadero at Sansome St
6_0   The Embarcadero at Sansome St  -  16_0   Steuart St at Market St
81_0   Berry St at 4th St  -  15_0   San Francisco Ferry Building (Harry Bridges Plaza)
196_0   Grand Ave at Perkins St  -  182_0   19th Street BART Station
182_0   19th Street BART Station  -  196_0   Grand Ave at Perkins St
6_0   The Embarcadero at Sansome St  -  15_0   San Francisco Ferry Building (Harry Bridges Plaza)
16_0   Steuart St at Market St  -  6_0   The Embarcadero at Sansome St
15_0   San Francisco Ferry Building (Harry Bridges Plaza)  -  81_0   Berry St at 4th St
90_0   Townsend St at 7th St  -  67_0   San Francisco Caltrain Station 2  (Townsend St at 4th St)
195_0   Bay Pl at Vernon St  -  182_0   19th Street BART Station
58_0   Market St at 10th St  -  67_0   San Francisco Caltrain Station 2  (Townsend St at 4th St)
50_0   2nd St at Townsend St  -  15_0   San Francisco Ferry Building (Harry Bridges Plaza)
58_0   Market St at 10th St  -  3_0   Powell St BART Station (Market St at 4th St)
16_0   Steuart St at Market St  -  81_0   Berry St at 4th St
17_0   Embarcadero BART Station (Beale St at Market St)  -  27_0   Beale St at Harrison St
In [372]:
combi_list = df.combi.value_counts().sample(20).keys()

df_criteria = df[df["combi"].isin(combi_list)]

df_pivot = df_criteria.pivot_table(index="start_station_id_new", columns = "end_station_id_new", values = "start_time", aggfunc = "count", fill_value = 0)

fig, axes = plt.subplots(figsize = (15,10), dpi = 80)
sns.heatmap(df_pivot, annot = True, cmap = "viridis_r", fmt='g', vmin = 1, vmax = 6000, mask= df_pivot == 0)
Out[372]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f7ee62358>
In [373]:
df.head()
Out[373]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... new_id label age month_year day_month_year dayofweek start_hr end_hr label_name combi
0 13709 2018-01-01 00:01:53.847 2018-01-01 03:50:23.433 86 Market St at Dolores St 37.769305 -122.426826 119 18th St at Noe St 37.761047 ... 86_0 2 2018 2018-01 2018-01-01 0 0 3 San Francisco 86_0 - 119_0
1 9996 2018-01-01 04:37:08.115 2018-01-01 07:23:44.143 86 Market St at Dolores St 37.769305 -122.426826 223 16th St Mission BART Station 2 37.764765 ... 86_0 2 2018 2018-01 2018-01-01 0 4 7 San Francisco 86_0 - 223_0
2 197 2018-01-01 09:29:57.079 2018-01-01 09:33:14.662 86 Market St at Dolores St 37.769305 -122.426826 85 Church St at Duboce Ave 37.770083 ... 86_0 2 61 2018-01 2018-01-01 0 9 9 San Francisco 86_0 - 85_0
3 183 2018-01-01 11:25:27.634 2018-01-01 11:28:31.059 86 Market St at Dolores St 37.769305 -122.426826 75 Market St at Franklin St 37.773793 ... 86_0 2 31 2018-01 2018-01-01 0 11 11 San Francisco 86_0 - 75_0
4 893 2018-01-01 12:07:39.271 2018-01-01 12:22:32.393 86 Market St at Dolores St 37.769305 -122.426826 42 San Francisco City Hall (Polk St at Grove St) 37.778650 ... 86_0 2 2018 2018-01 2018-01-01 0 12 12 San Francisco 86_0 - 42_0

5 rows × 28 columns

In [374]:
#creating the legend object for the next plot
legend_obj = []

colors = [sns.color_palette("viridis")[0],
          sns.color_palette("viridis")[1],
          sns.color_palette("viridis")[2],
          sns.color_palette("viridis")[3],
          sns.color_palette("viridis")[4],
          sns.color_palette("viridis")[5],
          (163/255, 199/255, 70/255)]

days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

for i, s in enumerate(days):
    legend_obj.append(plt.scatter([],[],color = colors[i]));  

Another interesting combination would be the day of the week combined with the average duration. For this we will create a groupby - object (mean) over each month-year combination. A Box Plot should be appropriate to visualize this relationship.

In [375]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot

fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year"], as_index = False).mean(), palette = "viridis")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5))
Out[375]:
<matplotlib.legend.Legend at 0x22f3f2a19e8>

We can clearly see, that under the week the users are going on shorter trips ~ 780 Seconds while the average duration on the weekend rises to ~ 1200 seconds.

In [376]:
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot

fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])

This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José. But what about the average duration based on the age?

In [377]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "age_bins", y = "duration_sec", data = df_age.groupby(["age_bins", "month_year"], as_index = False).mean(), palette = "viridis")
plt.ylim(250,1500)
Out[377]:
(250, 1500)

The longest duration can be found in the 10 to 20 years bins. After this bin the average duration drops, then it is slightly increasing until the 60, 70 bin. Now we should explore if the duration is also different based on the starting hour.

In [ ]:
df_age.to_csv("df_age.csv")
In [378]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "start_hr", y = "duration_sec", data = df.groupby(["start_hr", "month_year"], as_index = False).mean(), palette = "viridis")
Out[378]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f31ec1048>

Starting from the hours 0, 1, 2, 3 we maybe have to deal with outliers. On the other hand we saw earlier, that there are not much trips starting at that time, so longer trips have a stronger impact then at 8 o' cloc kfor example. From 5 - 9 trips are relatively short with ~ 600 seconds, then the average rises to ~950 from 10 - 15. From 16 - 20 it drops again to ~700 seconds to finally increase slightly around 22 and 23.

In [379]:
fig, ax = plt.subplots(figsize = (20,10), dpi = 80)
sns.boxplot(x = "start_hr", y = "duration_sec", data = df.groupby(["start_hr", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
Out[379]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f3ea232b0>

Here the hours 0 - 4 are getting more unclear. This graph confirms again, that San Francisco has, on average, the longest duration of trips, while users of East Bay and San José tend to have shorter trips.

In [380]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek", "month_year"], as_index = False).mean(), palette = "viridis")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5)) #misleading
Out[380]:
<matplotlib.legend.Legend at 0x22f4d344ba8>

This graph shows, that people start their trips, on average, later on the weekend than during the week.

In [381]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
Out[381]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f7ef7ecc0>

Looking at each area is interesting, because Users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.

In [382]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "month_year", y = "start_hr", data = df.groupby(["month_year","dayofweek"], as_index = False).mean(), palette = "viridis")
Out[382]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f7ebdcf60>

This shows, that on average the users start later in the middle of the year than at the beginning or the end of the year.

In [383]:
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "month_year", y = "start_hr", data = df.groupby(["month_year","dayofweek", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
Out[383]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f97f7f6a0>

Multivariate Exploration

For the last explorative visualization we will take a look on a visualization on the trips with kepler.gl

In [384]:
df_station_names.head()
Out[384]:
id station_name station_latitude station_longitude new_id label label_name
0 10 Washington St at Kearny St 37.795393 -122.404770 10_0 2 San Francisco
1 100 Bryant St at 15th St 37.767100 -122.410662 100_0 2 San Francisco
2 101 San Bruno Ave at 16th St 37.766008 -122.405677 101_0 2 San Francisco
3 101 15th St at Potrero Ave 37.767079 -122.407359 101_1 2 San Francisco
5 102 Irwin St at 8th St 37.766883 -122.399579 102_0 2 San Francisco
In [385]:
df_trips = df[["start_station_latitude", "start_station_longitude",  "end_station_latitude", "end_station_longitude","start_station_id_new", "end_station_id_new"]].copy()
In [386]:
df_trips["cnt"] = 1
In [387]:
df_trips_grp = df_trips.groupby([x for x in df_trips.columns[:-1]], as_index = False).sum().sort_values("cnt", ascending = False)
In [388]:
df_trips_grp.head()
Out[388]:
start_station_latitude start_station_longitude end_station_latitude end_station_longitude start_station_id_new end_station_id_new cnt
25841 37.795392 -122.394203 37.804770 -122.403234 15_0 6_0 8352
28155 37.804770 -122.403234 37.794130 -122.394430 6_0 16_0 5403
16138 37.775880 -122.393170 37.795392 -122.394203 81_0 15_0 5238
29022 37.808894 -122.256460 37.809013 -122.268247 196_0 182_0 4607
29136 37.809013 -122.268247 37.808894 -122.256460 182_0 196_0 4355
In [389]:
df_trips_grp.to_csv("grps.csv", index = False)

At first we will look at San Francisco. To get some insight, the visualization will only contain routes with more than 1000 trips: San Francisco Trips with more than 1000 trips

We can see that most of the trips are close to the beach. Now for East Bay with routes with more than 500 trips: East Bay Trips with more than 500 trips

Here the main routes are much more spread than in San Francisco. Also it looks like people use this service to quickly overcome smaller distances. For San José we will take a look on routes that have more than 200 trips. San Jose Trips with more than 200 trips

For San José it looks spread over most of the stations.

Summary and Conclusions

From our exploration we can note that:

  • This data covers three areas: San Francisco, East Bay and San José
  • The highest count of trips can be found in San Francisco, followed by East Bay and San José
  • The average trips is ~700 seconds long, the most trips were around ~500 seconds long
  • San Francisco has the longest trips with ~717 seconds, followed by East Bay with ~631 seconds and San José with ~ 627seconds
  • There are alot more male users than other users, for all areas there are always over 64% male
  • There are a lot more subscribers than customer using this service
  • The average user over all data is most likely between 24 and 35 years old
  • San Francisco has the oldest average users, followed by East Bay and San José, which has the youngest users with an average of ~ 30 years
  • People use the bikes more/in higher counts during the week than during the weekend
  • On the other hand the trips on the weekend are longer (~1100 seconds) than during the week (~780 seconds)
  • Also people start their trips later during the weekends than during the week (~ 14 'o clock instead of ~13:20 during the week)
  • Overall San Francisco has the longest average trips, followed by East Bay and San José
  • Users starting in the morning and in the evening have shorter trips than people who start between 11 and 15
  • In San Jose the trips start on average the latest, followed by East Bay and San Francisco
  • There is an increasing trend of usage
  • People start their trips most frequently at 8 and 17 'o clock
  • the most popular trip so far is between stations 15_0 San Francisco Ferry Building (Harry Bridges Plaza) and 6_0 The Embarcadero at Sansome St